Skills: Correlations and model fit
Week 3
This page demonstrates how to generate values for three difference measures of model fit.
- Correlation
- R-squared
- Standard error of regression
I’ll start by loading the full dataset from last week’s assignment (the one where I had added some random noise to the outcome).
library(tidyverse)
library(here)
library(knitr)
full_data <- here("week2",
"full-data.csv") |>
read_csv()Correlation
Correlation describes how well the relationship between two continuous variables can be described by a linear relationship.
R
In R, you can can calculate the correlation between all pairs of
variables in a data frame by using the cor() function.
Since we only want the correlations between pairs of continuous
variables, we’ll start by using the select() function to
choose just the variables we want to include in our correlation
table.
| sq_feet | dt_dist | rent | |
|---|---|---|---|
| sq_feet | 1.0000000 | 0.0072741 | 0.5911491 |
| dt_dist | 0.0072741 | 1.0000000 | -0.1842930 |
| rent | 0.5911491 | -0.1842930 | 1.0000000 |
Excel
In Excel, you can calculate the correlation between two variables
using the =CORREL() function.
R-squared
Another way to calculate a correlation is to estimate a model with a single predictor. The square root of the R-squared value will be the correlation between the predictor and the outcome. You can also use R-squared to decribe the fit of a model with multiple predictors
R
In R, after you estimate a model using the lm()
function, you can use the summary() function to see a
summary of the results.
The R-squared value will be shown as Multiple R-squared:
towards the bottom of the summary.
##
## Call:
## lm(formula = rent ~ sq_feet + dt_dist + color, data = full_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -446.89 -85.17 -8.46 77.36 518.58
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.255e+02 1.453e+01 15.52 <2e-16 ***
## sq_feet 6.986e-01 8.256e-03 84.61 <2e-16 ***
## dt_dist -4.824e+01 1.775e+00 -27.17 <2e-16 ***
## colorGreen 4.737e+01 3.898e+00 12.15 <2e-16 ***
## colorRed -1.052e+02 2.723e+00 -38.64 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 123.8 on 9995 degrees of freedom
## Multiple R-squared: 0.5063, Adjusted R-squared: 0.5061
## F-statistic: 2563 on 4 and 9995 DF, p-value: < 2.2e-16
You can also just return the R-squared value on its own. This is useful if you are comparing the fit of multiple models and you don’t want to be tempted to select your preferred model based on model coefficients and their associated p-values.
## [1] 0.5063459
Excel
In Excel, if you run the =LINEST() function to estimate
a regression model, the value in the first column of the third row will
be the R-squared value for the regression.
Standard error of regression
The standard error of the regression can be used to generate confidence intervals around a prediction.
R
In the output from the summary() function in R, the
standard error of the regression is above the R-squared value, and
labeled as Residual standard error:.
You can also pull out the standard error of regression directly by
referring to it as sigma.
## [1] 123.7943
Excel
In Excel, if you run the =LINEST() function to estimate
a regression model, the value in the second column of the third row will
be the standard error of the regression.